System and method of performing automatic speech recognition using local private data

ABSTRACT

A method of providing hybrid speech recognition between a local embedded speech recognition system and a remote speech recognition system relates to receiving speech from a user at a device communicating with a remote speech recognition system. The system recognizes a first part of speech by performing a first recognition of the first part of the speech with the embedded speech recognition system that accesses private user data, wherein the private user data is not available to the remote speech recognition system. The system recognizes the second part of the speech by performing a second recognition of the second part of the speech with the remote speech recognition system. The final recognition result is a combination of these two recognition processes. The private data can be such local information as a user location, a playlist, frequently dialed numbers or texted people, user contact list information, and so forth.

PRIORITY INFORMATION

The present application is a continuation of U.S. patent applicationSer. No. 15/606,477, filed May 26, 2017, which is a continuation of U.S.patent application Ser. No. 14/066,079, filed Oct. 29, 2013, now U.S.Pat. No. 9,666,188, issued May 30, 2017, the contents of which areincorporated herein by reference in their entirety.

BACKGROUND 1. Field of the Disclosure

The present disclosure relates to automatic speech recognition and moreparticularly to a system and method of performing automatic speechrecognition using an embedded local automatic speech recognition systemusing private user data and a remote network based automatic speechrecognition system.

2. Introduction

Some auto manufacturers have indicated the desire to provide a virtualassistant capability using a network speech recognizer. A vehicle orother mobile device is often but not always connected to a network suchas the Internet or a cellular network. When such a device is notconnected to a network, there should be functionality for performingautomatic speech recognition that is as close as possible to thatobtained by a recognizer with network capabilities. As is known in theart, a local speech recognition system either in an automobile or on amobile device may not have as much computing power as a network-basedautomatic speech recognition system. Accordingly, the results that maybe obtained from a local automatic speech recognition system willtypically be inferior to recognition performed in the network.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to describe the manner in which the above-recited and otheradvantages and features of the disclosure can be obtained, a moreparticular description of the disclosure briefly described above will berendered by reference to specific embodiments thereof which areillustrated in the appended drawings. Understanding that these drawingsdepict only exemplary embodiments and are not therefore to be consideredto be limiting of its scope, the concepts will be described andexplained with additional specificity and detail through the use of theaccompanying drawings in which:

FIG. 1 illustrates an example system embodiment;

FIG. 2 illustrates a two recognizer solution;

FIG. 3 illustrates another embodiment in which an embedded speechrecognizer coordinates with a remote speech recognizer;

FIG. 4 illustrates a method embodiment; and

FIG. 5 illustrates another method embodiment.

DETAILED DESCRIPTION

Various embodiments of the disclosure are discussed in detail below.While specific implementations are discussed, it should be understoodthat this is done for illustration purposes only. A person skilled inthe relevant art will recognize that other components and configurationsmay be used without parting from the spirit and scope of the disclosure.

The present disclosure addresses a need in the art to be able to performautomatic speech recognition in such a way as to coordinate a speechrecognition task between a local embedded speech recognition system anda remote or network-based speech recognition system in such a way thatcan use as well as protect private data. For example, a user may haveprivate data on a local device that the user does not desire to beshared in a network. Such data can include such things as a user'scontact list, frequently dialed numbers, a user location, a user's musicor video play list, and so on. However, such local private informationmay be useful in terms of performing automatic speech recognition inthat the user may, in voicing a command or an instruction, use afriend's name or a street name or artist name or song title.Accordingly, private information would be helpful in terms of speechrecognition, but a mechanism that is disclosed herein enables suchinformation to be utilized for automatic speech recognition butmaintained privately such that it is not transmitted into the networkfor use by a network-based speech recognition system.

Prior to proceeding with the discussion of the present disclosure, abrief introductory description of a basic general-purpose system orcomputing device is shown in FIG. 1 which can be employed to practicethe concept disclosed herein. A more detailed description of theconcepts of this disclosure will then follow. The disclosure now turnsto FIG. 1.

With reference to FIG. 1, an exemplary system includes a general-purposecomputing device 100, including a processing unit (CPU) 120 and a systembus 110 that couples various system components including the systemmemory such as read only memory (ROM) 140 and random access memory (RAM)150 to the processing unit 120. Other system memory 130 may be availablefor use as well. It can be appreciated that the concepts can operate ona computing device with more than one CPU 120 or on a group or clusterof computing devices networked together to provide greater processingcapability. The system bus 110 may be any of several types of busstructures including a memory bus or memory controller, a peripheralbus, and a local bus using any of a variety of bus architectures. Abasic input/output (BIOS) stored in ROM 140 or the like, may provide thebasic routine that helps to transfer information between elements withinthe computing device 100, such as during start-up. The computing device100 further includes storage devices such as a hard disk drive 160, amagnetic disk drive, an optical disk drive, tape drive or the like. Thestorage device 160 is connected to the system bus 110 by a driveinterface. The drives and the associated computer readable media providenonvolatile storage of computer readable instructions, data structures,program modules and other data for the computing device 100. The basiccomponents are known to those of skill in the art and appropriatevariations are contemplated depending on the type of device, such aswhether the device is a small, handheld computing device, a desktopcomputer, or a computer server.

Although the exemplary environment described herein employs the harddisk, it should be appreciated by those skilled in the art that othertypes of computer readable media which can store data that areaccessible by a computer, such as magnetic cassettes, flash memory,digital versatile disks, cartridges, random access memories (RAMs), readonly memory (ROM), a cable or wireless signal containing a bit streamand the like, may also be used in the exemplary operating environment.

To enable user interaction with the computing device 100, an inputdevice 190 represents any number of input mechanisms, such as amicrophone for speech, a touch-sensitive screen for gesture or graphicalinput, keyboard, mouse, motion input, speech and so forth. The input maybe used by the presenter to indicate the beginning of a speech searchquery. The device output 170 can also be one or more of a number ofoutput mechanisms known to those of skill in the art. In some instances,multimodal systems enable a user to provide multiple types of input tocommunicate with the computing device 100. The communications interface180 generally governs and manages the user input and system output.There is no restriction on the concepts disclosed herein operating onany particular hardware arrangement and therefore the basic featureshere may easily be substituted for improved hardware or firmwarearrangements as they are developed.

For clarity of explanation, the illustrative system embodiment ispresented as comprising individual functional blocks (includingfunctional blocks labeled as a “processor”). The functions these blocksrepresent may be provided through the use of either shared or dedicatedhardware, including, but not limited to, hardware capable of executingsoftware. For example the functions of one or more processors presentedin FIG. 1 may be provided by a single shared processor or multipleprocessors. (Use of the term “processor” should not be construed torefer exclusively to hardware capable of executing software.)Illustrative embodiments may comprise microprocessor and/or digitalsignal processor (DSP) hardware, read-only memory (ROM) for storingsoftware performing the operations discussed below, and random accessmemory (RAM) for storing results. Very large scale integration (VLSI)hardware embodiments, as well as custom VLSI circuitry in combinationwith a general purpose DSP circuit, may also be provided.

The disclosure now turns to more particular features of the solution. Inone aspect, the proposed solution is a hybrid recognition arrangementwith an embedded recognizer in a car or on another mobile device and anetwork recognizer in “the cloud.” FIG. 2 illustrates such anarrangement. Although the system 200 is shown including a car 202,clearly the structure can also be provided on a mobile device or otherdevice with the same or similar functionality. As is shown in FIG. 2,the system includes a software application 208, an applicationprogramming interface (API) 206 and an embedded recognizer 204. Otherfeatures include a microphone 210 and a speaker 212. The API 206communicates via a network 216 with a network-based speech recognitionsystem 214. The network recognizer 214 has the benefit of more centralprocessing power, more memory and better network connectivity. Theembedded recognizer 204 can implement a speech recognition model thatcan access private user data.

Both recognizers can be used serially or in parallel to join or arriveat a recognition result. At least some user information is available tothe embedded recognizer 204 and not to the network recognizer 214.Recognizer 214 could also simply be remote from recognizer 204physically or virtually. For example, in a local device (which can be aseparate device such as a smartphone or mobile device or the combinationof features 204, 206 and 208 in FIG. 2), additionally, there can be amemory which stores such data as a user's contact list, frequentlydialed numbers and called/texted names, user location, a user's music orvideo play list including frequency of use, and other private data.Private data may also include a user's history, including audio or textfrom previous interactions with the system, previous locations, etc., ordata derived from the user's history such as a record of frequently usedwords or acoustic parameters of the user's voice. A user's privatecontact list can be used by voicemail transcription, voice dialing, andshort messaging service messaging to determine a recipient of themessage. The contact list may be used to narrow down the list ofpossible names recognized to more accurately determine which name wasspoken and to determine the correct spelling. A database of frequentlydialed numbers called or texted/emailed names could be used to refinecandidate probabilities in recognizing the spoken name or number in aspeech command from the user. A user location could be used in anavigation or a point of interest serviced to narrow down the list ofstreet names, cities, businesses or other points of interest accordingto the location of the user or a direction of travel or a direction theuser is facing based on such data. The various pieces of data associatedwith private interactions as set forth above is not meant to be anexhaustive list. There also may be data saved from a status of a videogame the user has been playing, or scores or names in fantasy footballtournaments the user has played or is playing, or downloaded articlesfrom news sources that are stored locally. Thus, the private user datais broad enough to encompass any such data beyond just data stored in acontacts list. Private data may be stored locally, in a network, orderived from multiple local and remote sources.

With a music or video play list, the various data such as the artist,track title, album title and so forth could be used to recognizerequests such as “play something by the Beatles” or “what movies do Ihave starring Harrison Ford?” Knowing the user's listening habits helpto guide the recognition to the most likely candidates. In anotheraspect, the network's speech recognition system 214 can receive a speechrequest and return multiple hypotheses, each making differentassumptions. For example, the mobile system may report 3 possiblelocations and the network recognizer 214 could return 3 correspondingresults. The embedded recognizer 204 can then use private userinformation or some other criteria to select a most likely assumption orset of assumptions and the associated hypothesis. In this regard,utilizing local private user data may help improve the speechprocessing. The embedded speech recognizer 204 may also operate alonewhen the network speech recognizer 214 is unavailable such as when thereis a lack of a connection to the network.

Furthermore, the approach disclosed herein may minimize the use ofnetwork bandwidth needed to send information such as user metadatabetween a device and the network 216. The metadata can include lists ofdata about the user such as a part or all of an address book, songcatalog, etc. Such data can take up a large amount of bandwidth Metadatacan also refer to any data typically outside of an actual message suchas instructions from a user “SMS John this message . . . meet me athome.” In this case, the “SMS John this message” portion is metadataoutside of the actual message, which is “meet me at home.” Accordingly,the solution disclosed herein can retain advantages of embeddedautomatic speech recognition with respect to privacy and operation whenthe network is unavailable while also maintaining the benefit of anetwork-based speech recognition system 214 which has more centralprocessing unit power and access to other on-line resources.

One example discussed next is for a short messaging service (SMS)application, where a powerful dictation model which runs best on thenetwork 214 recognizes a message. The system may also identify from themessage the addressee of the message. This data may be taken frommetadata and not within the message itself. When the name is in themessage, in order to perform recognition of such data, the system mayneed to access the user's contact lists in order to properly recognizethe name. The process to recognize the addressee can run on a localembedded recognizer 204. A similar approach for other cases can existwhere private information is mixed with dictation from the user. Thiscan occur in situations such as in a calendar application where a usermay dictate a meeting description along with names of contacts ormeeting participants. The system may also analyze the contacts list inorder to properly recognize a name that is not fully spelled out in amessage. The message may say “DL, are you coming tonight?” The systemcan analyze the message to determine that “DL” is likely a name, andthen obtain from a private contacts list the understanding that DL is afriend of the sender. Thus, the system can return an identification ofthe addressee as David Larson.

FIG. 3 illustrates a more general approach 300 in which a device 302includes an embedded automatic speech recognizer 304 communicates 306through a network 308 with a network-based automatic speech recognizer310. In the example of an SMS message, assume that message speechutterance is sent from a device 302. The speech may be sent to anotherdevice not shown but data associated with the speech is sent 306 to thenetwork-based speech recognition system 310. Assume that in the speechutterance is a user's name—Bartholomew. The network recognizer 310 usesa placeholder in place of the user's name and returns to the device 302a lattice word confusion matrix or n-best list. The placeholder cancover any entity such as a name, location, playlist song, recipientname/phone number, favorite food, birthday or other day, etc. The localembedded system 304 evaluates the network recognizer's output 306through the network 308 in light of user data 312 to select the finalanswer. A placeholder for names may be such things as a garbage model,phonemic language model, a language model based on a standard list ofnames, a context-free grammar, and/or a statistical language model builtfrom the user's contact list, but where the individual names would notbe extractable from the statistical language model. Of course, thisutilizes the contact list as an example but any other kind of data suchas calendar data, play list data and so forth could be used. In otherwords, the placeholder may be for a street name, or a current locationof the user, and so forth. Using a context-free grammar could alsoobscure the contacts list, or any other related data, since it would notnecessarily show which first names go with which last names.

The name may be part of a message (“Hey, Bartholomew, call me or die.”)or separate from the message, such as part of a preamble (“SMSBartholomew, call me or die.”) The “SMS Bartholomew” in this example isan instruction to the system to send an SMS message to Bartholomew, andthe message that is sent is “call me or die.”

In one aspect the network recognizer 310 recognizes the full utteranceit receives 306 and the intent (for example to send an SMS) and benefitsfrom the network model for the transcription portions of the data, suchas the actual SMS message. For the addressee, the system can use aplaceholder such as a garbage model, a large name list, or a statisticallanguage model built from the user's contact list, a model based onmetadata and/or private data, some other kind of phoneme level model,and/or a phonemic language model. The network recognizer's output 306can look something like the following:

“Send SMS to_ADDRESSEE_[message] I'll see you at 5:00[\message][intent=SMS]”

The above result can be sent to the local client device 302 along withword timings. The “intent=SMS” is an intent classification tag andrepresents an example of how a recognizer may output the intentseparately from the transcribed text string. The embedded recognizer 304can use the word timings to analyze a portion of the audio, storedlocally, that corresponds to _ADDRESSEE_(—) and run that audio against alocal model containing the user's actual contacts to determine thespoken contact name. The contact name is then inserted into therecognized string such as follows:

“Send SMS to [name] Bartholomew [\name] [message] I'll see you at 5:00[\message].”

As a variation on the previous solution, the local recognizer 304 canprocess the entire audio instead of just the part corresponding to_ADDRESSEE_, then it can extract a name and insert it into therecognized string.

In an alternative embodiment, both recognizers 304, 310 can return aresult and the outputs (such as best, n-best, word confusion matrix orlattice) are compared in a ROVER-style fashion. ROVER means recognitionoutput voting error reduction. The embedded recognizer's answers may begiven preferential weight for words derived from the user's contact listor other user data.

Another variation could be to assign a portion of the utterance (such as“send an SMS message to John Smith) to the embedded recognizer 304 andanother (“meet me a 5:00”) to the network recognizer 310. As similarapproach may be to let one recognizer parse the utterance, recognizepart, and assign the other part to the other recognizer. Similarly, inanother approach, the system may allow for the embedded recognizer 304to send the recognized words that it derived from user data to thenetwork recognizer 310, which then uses this information to generate anoutput. This exposes some data from the user's spoken request, but notfrom the entire set of user data. Indeed, the data that could be chosenin this case may be generic names that would not provide muchinformation regarding the user. For example, names like “Mark” or “Matt”might be sent whereas other specific names such as “Kenya” may not besend inasmuch as they are deemed to be unique enough so as to have morerelevance to the user and thus reveal too much private information.

In another aspect, the system can run both the embedded recognizer 304and the network recognizer 310 and then use output from one or the otherbased on the recognized user intent, confidence scores from one or bothrecognizers, or other criteria. The embedded recognizer 304 may takecontrol for simple utterances that require user metadata and the networkrecognizer 310 can process utterances that require high accuracy but notprivate information. For example, if the user dictates an email message,the network recognizer's output 306 is used. If the user speaks a carcontrol function (“turn up the heat”), then the output generated by theembedded recognizer 304 is kept local on the device 302 and neitheraudio nor recognized result are transmitted 306 to the network-basedserver 310.

In another example, the system can pinch the phonetic lattice for theportion of the utterance with a TAG or likely name. A sequence matchingalgorithm such as a Smith-Waterman algorithm can take into account acost matrix (based on a priority confusability) and can be used to matchthe pinched lattice against the entries in the list that needs to bekept private. It is expected that the large cross-word triphone model inthe network will generate a more accurate phonetic lattice.

The concept of a pinched lattice is discussed next. Suppose the 1-besthypothesis of the input utterance is recognized as “send SMS message to_NAME Noah Jones.”

From the entire lattice that generated the best path hypothesis, thesystem can take all the paths starting with a timestamp at or a fewmilliseconds before the word “Noah”. This is called a pinched latticebecause the system is zooming in on a particular portion of the lattice.The system can perform such pinching at the word lattice or phonelattice since the start and end times of interest are known.

The pinched lattice (i.e., the portion of phonetic lattice) may haveseveral paths. Given a predetermined private list of names and thepinched lattice along with a phonetic confusion matrix for a particularacoustic model, the system can find the closest match of the latticewith respect to the name list. For example, the system can determinethat “Norah Jones” is the highest re-ranked name after this process. Theclosest match problem can be formulated as a sequence matching problemif the system expands out all the private names into their phonetictranscription and performs a match with the pinched lattice.

To simplify development integration work, the embedded recognizer 304and the network recognizer 310 can also share an API within the device302. The API can be a software module that accepts requests fromapplication code and routes requests to one or both recognizers asneeded. The decision of which recognizer (304 or 310) or both should beinvoked can be transparent to the application that is requesting thespeech recognition.

A benefit of the solution disclosed herein is that it allows users toretain private date on a mobile device and not share it with a networkrecognizer. Thus, the user maintains control of sensitive informationand need not have a trusted relationship with the network provider. Atthe same time, the disclosed concepts provide at least some of thebenefit of giving a local system access to the network recognizer 310.The concepts also provide a consistent user experience when the networkis disconnected or reduce the required network bandwidth.

FIG. 4 illustrates a method embodiment of this disclosure. As shown inFIG. 4, the method, when practiced by an appropriate system, includesreceiving speech from a user at a device communicating via a networkwith a network-based speech recognition system, wherein the deviceincludes an embedded speech recognition system that accesses privateuser data local to the device (402). The network-based speechrecognition may also simply be remote from the embedded speechrecognition system. For example, the system may not be network-based butsimply physically or virtually remote from the embedded system.

The system recognizes a first part of the speech by performing a firstrecognition of the first part of the speech with the embedded speechrecognition system that accesses the private user data, wherein theprivate user data is not available to the network-based speechrecognition system (404). As noted above, the private data can be anytype of data on the device such as contact information data, locationdata, frequently called numbers or frequently texted individuals, usagehistory, and so forth. Depending on the type of data and where it may bestored, the system can assign different parameters which indicate thelevel of privacy. For example, if the location data indicates that theuser is at work during the day, and that the user is commonly at workduring the day, then that data may not have a high level of privacy.However, if the location information is unusual or not part of thestandard routine of the individual, then the system may have a reducedlevel of privacy and utilize that data or transmit that data to thenetwork based recognizer for recognition.

The system can recognize a second part of the speech by performing asecond recognition of the second part of the speech with thenetwork-based speech recognition system (406).

The first part of the speech can comprise data associated with apersonal identification rather than words spoken by the user. In thisscenario, the speech received from the system may be “email home and sayI will be there in 20 minutes.” There is no person that is identifiedwithin the speech. However, the system may know that when one says email“home” that a contact list of other database can indicate that thatperson is emailing his wife. Her name can be data that is associatedwith a personal identification rather than words actually spoken by theuser. The user may also choose a recipient and then begin dictation ofthe message to that recipient. Since the user does not say “text Mom” or“Text home,” but rather brings up his wife via a non-verbal method andthen dictates “I′ll be there in 20 minutes.” The recipient data is notfound within the dictation portion of the message. Thus, the recipientinformation is gathered from another source.

Next, recognizing the first part of the speech and recognizing thesecond part of the speech can further include receiving from the remoteor network-based speech recognition system an entity associated withoptions to use for speech recognition. The system can evaluate theentity at the embedded speech recognition system in view of the privateuser data to yield an evaluation and then select a recognition resultbased on the evaluation. The system can receive data from remote or thenetwork based speech recognition system which includes a placeholder inplace of a name as noted above. The placeholder can be one of a garbagemodel, a language model based on a standard list of names, and astatistical language model built from a user's contact list. Similarly,the placeholder could relate to other things such as street names andcould also include a garbage model, or a language model based one astandard list of street names that are commonly driven or around thehome of a user and a statistical language model built from addresseseither associated with a user's contact list or street names traveled byor near to the user.

FIG. 5 illustrates yet another method embodiment of this disclosure.First, the system receives data comprising a message portion and a useridentification portion (502). The system recognizes the message portionand generates message text inserting a placeholder into the useridentification portion (504). The system then transmits the placeholderand the message text to a device, wherein the device extracts audioassociated with the placeholder and utilizes private, local data toperform speech recognition on the placeholder to yield second text, andwherein the second text is inserted into the text (506). An example ofthe method of FIG. 5 in practice could be the following. Assume that aremote or network-based automatic speech recognition system receives anaudio signal or data that includes a message portion and useridentification portion such as “Hello, Bartholomew, how are you today?”The system then would recognize the message portion but not the useridentification portion (Bartholomew). It would generate message textinserting a placeholder into the user identification portion. Forexample, the generated message text could be something like “Hi,_NAME_,how are you today?” The system could then transmit the placeholder andthe message text to a device such as a handheld smartphone, and thedevice could extract audio associated with the placeholder. In thiscase, the remote or network-based system would transmit to the device inaddition to the placeholder a small portion of audio associated with thename “Bartholomew.” In one variation, the remote or network-based systemcould transmit multiple names that have been determined to be reasonablematches, or a lattice of possible results, and the local recognizercould use this information, in combination with user data, to determinethe final result.

It is noted that if the device had transmitted the entire dataoriginally to the remote or network-based device, then returning theaudio may or may not be necessary, since the network can return endpointtime markers instead. Some processing can occur to the audio at theremote or network-based device such as removing artifacts or backgroundnoise or performing some other processing which can improve the localspeech recognition. In any event, the local device then operates on theplaceholder and audio associated with the placeholder to utilizeprivate, local data to perform speech recognition on or associated withthe placeholder. In this case, the user may have a local private contactlist that includes the name “Bartholomew.” Speech recognition istherefore performed on the audio associated with the placeholder toyield the text representing the audio. The second text is generated andinserted into the overall message to yield the final text associatedwith the original audio which is “Hi, Bartholomew, how are your today?”In this manner, the system can blend the use of a more powerfulnetwork-based automatic speech recognition system and some availablepower on a smaller local device while utilizing private local data aspart of the speech recognition process of a name.

Of course, the placeholder does not just have to relate to names but canrelate to any words or any data that might be part of a privatedatabase, stored locally or in a private location in the network or atsome other private storage location. This could include telephonenumbers, street names, relationships such as wife, spouse, friend, datessuch as birthdates, user history, and so forth. Thus the placeholderthat is discussed could also relate to any other kind of data which isdesirable to be kept private and which a user may prefer to have theautomatic speech recognition occur locally in a manner in which privatedata may be used to improve the speech recognition process. The userhistory can include such data as audio or text from previousinstructions, previous location information, and communications sent orreceived by the user, etc. The user history can also include dataderived from the user's history such as a record of frequently usedwords, bolded or underlined text, or acoustic parameters of the user'svoice.

Embodiments within the scope of the present disclosure may also includecomputer-readable media for carrying or having computer-executableinstructions or data structures stored thereon. Such computer-readablemedia can be any available media that can be accessed by a generalpurpose or special purpose computer. By way of example, and notlimitation, such computer-readable media can comprise RAM, ROM, EEPROM,flash memory, CD-ROM or other optical disk storage, magnetic diskstorage or other magnetic storage devices, or any other medium which canbe used to carry or store desired program code means in the form ofcomputer-executable instructions or data structures. When information istransferred or provided over a network or another communicationsconnection (either hardwired, wireless, or combination thereof) to acomputer, the computer properly views the connection as acomputer-readable medium. Thus, any such connection is properly termed acomputer-readable medium. Combinations of the above should also beincluded within the scope of the computer-readable media.

Computer-executable instructions include, for example, instructions anddata which cause a general purpose computer, special purpose computer,or special purpose processing device to perform a certain function orgroup of functions. Computer-executable instructions also includeprogram modules that are executed by computers in stand-alone or networkenvironments. Generally, program modules include routines, programs,objects, components, and data structures, etc. that perform particulartasks or implement particular abstract data types. Computer-executableinstructions, associated data structures, and program modules representexamples of the program code means for executing steps of the methodsdisclosed herein. The particular sequence of such executableinstructions or associated data structures represents examples ofcorresponding acts for implementing the functions described in suchsteps.

Those of skill in the art will appreciate that other embodiments of thedisclosure may be practiced in network computing environments with manytypes of computer system configurations, including personal computers,hand-held devices, multi-processor systems, microprocessor-based orprogrammable consumer electronics, network PCs, minicomputers, mainframecomputers, and the like. Embodiments may also be practiced indistributed computing environments where tasks are performed by localand remote processing devices that are linked (either by hardwiredlinks, wireless links, or by a combination thereof) through acommunications network. In a distributed computing environment, programmodules may be located in both local and remote memory storage devices.A computer-readable device storing instructions expressly excludesenergy and signals per se.

Although the above description may contain specific details, they shouldnot be construed as limiting the claims in any way. Other configurationsof the described embodiments of the disclosure are part of the scope ofthis disclosure. For example, the data held locally that is used forspeech recognition might be metadata or tags of pictures or video on thelocal device. Any data that is considered private data can be utilizedin the manner disclosed above for processing speech. Accordingly, theappended claims and their legal equivalents should only define the scopeof coverage, rather than any specific examples given.

I claim:
 1. A method comprising: receiving speech from a user at adevice communicating with a network-based speech recognition system,wherein the device comprises an embedded speech recognition system thataccesses private user data local to the device; recognizing a first partof the speech by performing a first recognition of the first part of thespeech with the embedded speech recognition system that accesses theprivate user data, wherein the private user data is not available to thenetwork-based speech recognition system; and recognizing a second partof the speech by performing a second recognition of the second part ofthe speech with the network-based speech recognition system.
 2. Themethod of claim 1, further comprising: outputting a recognition resultassociated with the first recognition or the second recognition, therecognition result comprising text representing the speech, the textbeing viewed on a display.
 3. The method of claim 1, further comprising:identifying a location of the device.
 4. The method of claim 3, furthercomprising: determining a privacy level of the private user dataaccording to the location of the device.
 5. The method of claim 1,wherein individual names from a user contact list are from the privateuser data.
 6. The method of claim 1, wherein the private user datacomprises one of data in a user contact list, frequently dialed phonenumbers, frequently used texted names, data associated with a userlocation, data associated with a playlist, user history, and multiplehypothesis associated with private information.
 7. A computer-readablestorage device storing instructions which, when executed by a processor,cause the processor to perform operations comprising: receiving speechfrom a user at a device communicating with a network-based speechrecognition system, wherein the device comprises an embedded speechrecognition system that accesses private user data local to the device;recognizing a first part of the speech by performing a first recognitionof the first part of the speech with the embedded speech recognitionsystem that accesses the private user data, wherein the private userdata is not available to the network-based speech recognition system;and recognizing a second part of the speech by performing a secondrecognition of the second part of the speech with the network-basedspeech recognition system.
 8. The computer-readable storage device ofclaim 7, wherein the computer-readable storage device stores additionalinstructions which, when executed by the processor, cause the processorto perform operations further comprising: outputting a recognitionresult associated with the first recognition or the second recognition,the recognition result comprising text representing the speech, the textbeing viewed on a display.
 9. The computer-readable storage device ofclaim 7, wherein the computer-readable storage device stores additionalinstructions which, when executed by the processor, cause the processorto perform operations further comprising: identifying a location of thecomputing device.
 10. The computer-readable storage device of claim 9,wherein the computer-readable storage device stores additionalinstructions which, when executed by the processor, cause the processorto perform operations further comprising: determining a privacy level ofthe private user data according to the location of the computing device.11. The computer-readable storage device of claim 7, wherein individualnames from a user contact list are from the private user data.
 12. Thecomputer-readable storage device of claim 7, wherein the private userdata comprises one of data in a user contact list, frequently dialedphone numbers, frequently used texted names, data associated with a userlocation, data associated with a playlist, user history, and multiplehypothesis associated with private information.
 13. A system comprising:a processor; and a computer-readable storage medium storing instructionswhich, when executed by the processor, cause the processor to performoperations comprising: receiving speech from a user at a devicecommunicating with a network-based speech recognition system, whereinthe device comprises an embedded speech recognition system that accessesprivate user data local to the device; recognizing a first part of thespeech by performing a first recognition of the first part of the speechwith the embedded speech recognition system that accesses the privateuser data, wherein the private user data is not available to thenetwork-based speech recognition system; and recognizing a second partof the speech by performing a second recognition of the second part ofthe speech with the network-based speech recognition system.
 14. Thesystem of claim 13, wherein the computer-readable storage medium storesadditional instructions which, when executed by the processor, cause theprocessor to perform operations further comprising: outputting arecognition result associated with the first recognition or the secondrecognition, the recognition result comprising text representing thespeech, the text being viewed on a display.
 15. The system of claim 13,wherein the computer-readable storage medium stores additionalinstructions which, when executed by the processor, cause the processorto perform operations further comprising: identifying a location of thesystem.
 16. The system of claim 15, wherein the computer-readablestorage medium stores additional instructions which, when executed bythe processor, cause the processor to perform operations furthercomprising: determining a privacy level of the private user dataaccording to the location of the system.
 17. The system of claim 13,wherein individual names from a user contact list are from the privateuser data.
 18. The system of claim 13, wherein the private user datacomprises one of data in a user contact list, frequently dialed phonenumbers, frequently used texted names, data associated with a userlocation, data associated with a playlist, user history, and multiplehypothesis associated with private information.